Gang termination fix #353

gflarity · 2026-01-19T18:15:04Z

What type of PR is this?

/kind feature
/kind bug
/kind api

What this PR does / why we need it:

This PR fixes and improves gang termination behavior in Grove with several key changes:

Gang termination is now opt-in - terminationDelay no longer defaults to 4 hours. When nil, gang termination is disabled entirely for the PodCliqueSet.
Only ready pods count for breach detection - Previously breach calculation considered scheduled pods; now it only counts ready pods, which more accurately reflects actual workload availability.
PCSG-level terminationDelay override - Individual PodCliqueScalingGroups can now override the PCS-level termination delay, allowing finer-grained control over gang termination timing.
Bug fix: nil pointer when base gang doesn't exist - Fixed a crash when the base gang no longer exists after gang termination.

Which issue(s) this PR fixes:

Fixes #277

Special notes for your reviewer:

Start with the design doc at docs/designs/gang-termination.md - it provides a high-level overview of the gang termination feature after these changes. Happy to update the code alongside any design doc feedback.
Comprehensive E2E tests have been added covering multiple gang termination scenarios (GT1-GT5)
Validation ensures PCSG-level terminationDelay can only be set when PCS-level terminationDelay is set

Does this PR introduce a API change?

Gang termination is now opt-in. The `terminationDelay` field no longer defaults to 4 hours - when not set, gang termination is disabled. To enable gang termination, explicitly set `spec.template.terminationDelay` on PodCliqueSet.  Administrators who were expecting Gang Termination at the PodCliqueSet or PodCliqueScalingGroup levels after 4 hours should update their PodCliqueSets according. 

Added `WasAvailable` status field to PodClique to track whether a workload has ever reached its MinAvailable threshold. Gang termination only triggers for workloads that were previously available, preventing termination during initial creation and startup.

Added `terminationDelay` field to PodCliqueScalingGroupConfig allowing per-PCSG override of the PCS-level termination delay.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

docs/designs/gang-termination.md - Comprehensive design document explaining gang termination behavior, configuration, and debugging

shayasoolin

Great job!

docs/designs/gang-termination.md

shayasoolin · 2026-01-20T16:28:35Z

operator/internal/webhook/admission/pcs/validation/podcliqueset.go

+				allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
+			} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
+				allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
+			}


and another else-if case, to validate that the PCSG-specific termination delay is not greater than the PCS one.

See https://docs.google.com/document/d/11RvFMS1l5RH_FY54G6wd1Q0wSJ2RCobiPRAs2vAhzgM/edit?disco=AAAByqL0jIQ. I discussed with @nvrohanv and @athreesh and while I don't see why someone would do this, we're not going to babysit this.

I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

Let's chat with @nvrohanv and @athreesh. I'm ambivalent tbh, but we did a final pass on the requirements and this was the call.

Yes the reason why I don't think we should babysit this is we assume if you choose to do gang termination you know what you are doing. In that scenario I can understand someone being more tolerant if out of 3 pcsg one is down they give it a longer amount of time to recover than if the pcs minAvailable is breached then they have less tolerance because the system is not functional. Since theres a valid use case for setting it to be more and we assume this is a power-user feature I think we should provide full flexibility.

athreesh · 2026-01-21T06:06:59Z

LGTM!

How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?
Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

gflarity · 2026-01-21T15:46:14Z

How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

gflarity · 2026-01-21T15:47:54Z

Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

sanjaychatterjee · 2026-01-22T17:03:41Z

I will review it this week.

sanjaychatterjee · 2026-01-23T20:25:24Z

operator/api/core/v1alpha1/podcliqueset.go

+	// TerminationDelay overrides the PodCliqueSet-level terminationDelay for this scaling group.
+	// Can only be set if PodCliqueSetTemplateSpec.TerminationDelay is set (gang termination is enabled).
+	// When set, this value is used instead of the PodCliqueSet-level terminationDelay for gang termination
+	// decisions affecting this scaling group's replicas.


Do we need validation that this value should be less than or equal to the PCS one?

sanjaychatterjee · 2026-01-23T20:30:10Z

operator/internal/webhook/admission/pcs/validation/podcliqueset.go

+				allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
+			} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
+				allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
+			}


I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

sanjaychatterjee · 2026-01-23T20:40:05Z

operator/internal/controller/podcliqueset/components/podcliquesetreplica/gangterminate.go

+// for each PCSG (PCSG override if set, otherwise PCS default).
 // It returns the names of all such PodCliqueScalingGroups and minimum of all the waitDurations.
-func getMinAvailableBreachedPCSGInfo(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, terminationDelay time.Duration, since time.Time) ([]string, time.Duration) {
+func getMinAvailableBreachedPCSGInfoWithEffectiveDelay(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, pcs *grovecorev1alpha1.PodCliqueSet, since time.Time) ([]string, time.Duration) {


Can we add a unit test?

sanjaychatterjee · 2026-01-23T20:40:26Z

operator/internal/controller/podcliquescalinggroup/components/podclique/sync.go


+// findMatchingPCSGConfig finds the PodCliqueScalingGroupConfig that matches the given PCSG.
+// It returns nil if no matching config is found.
+func findMatchingPCSGConfig(pcs *grovecorev1alpha1.PodCliqueSet, pcsg *grovecorev1alpha1.PodCliqueScalingGroup) *grovecorev1alpha1.PodCliqueScalingGroupConfig {


Can we add a unit test?

sanjaychatterjee · 2026-01-23T20:44:34Z

operator/api/core/v1alpha1/podclique.go

+	// This field is used for gang termination: a PodClique can only be considered in breach of
+	// MinAvailable if it was previously available (WasAvailable=true).
+	// +kubebuilder:default=false
+	WasAvailable bool `json:"wasAvailable"`


Do we need this in the API? Wouldn't an update event tell us if the old version of the object was available but the new version is not?

I mentioned this in the "GREP":

Motivation for WasAvailable:

Ignore Initial Startup: New PodCliques should not be terminated simply because they haven't yet reached their MinAvailable threshold. The WasAvailable gate ensures gang termination only triggers for workloads that were previously healthy and then became degraded—not for workloads that are still starting up.

Operator Crash Resilience: The WasAvailable flag is persisted in the PodClique's status (not held in memory). This ensures that if the Grove operator crashes and restarts during the transition to availability, the flag is not lost. Because it's a sticky bit that never reverts to false, the operator can safely restart at any time without missing the transition or incorrectly treating a previously-available PodClique as never-available.

I'll go into a bit more detail in the doc, but we do need to persist this information for a few reasons:
• If the operator crashes and restarts, we can't reliably infer whether a PodClique was ever available just from the conditions. The UpdateInProgress reason doesn't tell us either way, so we'd be guessing.
• There's also a race condition where pods could briefly reach MinAvailable then die before we reconcile. Without the sticky bit, we'd lose that info and wrongly treat it as "never available."

I also considered just adding another condition, but decided against it as it follows the Kubernetes pattern/convention of using status fields for internal bookkeeping (like observedGeneration) rather than conditions. Conditions are more for user-facing state they'd alert on it. So ultimately I went with the separate status boolean.

Update - After going through the edge case below, I do see that if we respect the previous MinAvailableBreached=True on operator restart (and perhaps all the time) and not require a "FLIP" it would solve this edge case. I'll revive the implementation and see if there are other addition edge cases that pop up. That said, WasAvailable definitely has advantages in terms of simplicity and being explicitly observable though.

For more background, here's the initial approach I tried and abandoned:

The PodClique controller watches "owned" Pods using a predicate. This triggers PodClique controller reconciliation when pods are deleted or have status changes (specifically when the Ready condition transitions). During this reconciliation, the "old" (current but to be reconciled) PodClique object is retrieved from the Kubernetes API. We then "Reconcile" and determine what the new values should be for status fields like PodClique.Status.ReadyReplicas. At this point we essentially have an "old" and "new" PodClique object, allowing us to notice changes in PodClique.Status.ReadyReplicas. Using this approach, we could avoid setting MinAvailableBreached to true unless oldPodClique.Status.ReadyReplicas >= minAvailable in the "old" object. This is insufficient given our requirements however. Consider the following edge case:

Common Starting Point

Time Event MinAvailableBreached Status Reason WasAvailable

T0 PCLQ created, 0 pods False InsufficientScheduledPods false

T1 3 pods scheduled, starting False InsufficientScheduledPods false

At T1, both scenarios are identical: 3 pods scheduled and starting.

Scenario A: Workload Becomes Available Then Degrades

Time Event MinAvailableBreached Status Reason WasAvailable

T2 All 3 pods become ready False SufficientReadyPods true

T3 1 pod dies (2/3 ready) True InsufficientReadyPods true

T4 Operator crashes - - -

T5 Operator restarts True InsufficientReadyPods ?

Should gang terminate? YES - workload was healthy, now degraded.

Scenario B: Workload Never Becomes Available

Time Event MinAvailableBreached Status Reason WasAvailable

T2 1 pod crashes during init True InsufficientReadyPods false

T3 Operator crashes - - -

T4 Operator restarts True InsufficientReadyPods ?

Should gang terminate? NO - workload never achieved availability.

State After Restart Comparison

Field Scenario A (T5) Scenario B (T4)

Status True True

Reason InsufficientReadyPods InsufficientReadyPods

ScheduledReplicas 3 3

ReadyReplicas 2 2

Should gang terminate? YES NO

The condition and status fields are identical. Without WasAvailable, the operator cannot distinguish between these scenarios.

sanjaychatterjee · 2026-01-23T20:50:28Z

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

sanjaychatterjee · 2026-01-23T20:51:44Z

How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

Thanks for the release notes. Additionally, we need a GREP and usage docs as well.

gflarity · 2026-01-24T02:20:36Z

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

Oh I included gang-termination.md as a GREP and used the MNNVL doc as a template. I can take a look at the template PR follow that layout, sure.

nvrohanv · 2026-01-24T21:20:12Z

Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

we do need the documentation but i can take the action item of adding that once this is merged in. In general I think the flow should be

feature gets merged in and api documentation is updated as part of its pr
docs/user-guide gets updated with proper guide in separate pr

unmarshall

1/n reviews

unmarshall · 2026-01-25T07:05:44Z

docs/designs/gang-termination.md

@@ -0,0 +1,375 @@
+# Gang Termination Design Doc


designs directory is going to removed. Please rework on this document and use the GREP template instead. Put the document under proposals. Before writing a GREP please also read docs/proposals/README.md

Maybe you can just move this document out of this PR. Lets create a single document for gang scheduling and termination. This doc as it stands today needs a bit of work else it will hold your fix for no good reason.

unmarshall · 2026-01-25T07:09:03Z

docs/designs/gang-termination.md

+
+# Overview
+
+Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.


Can you define a component?
Ideally one should introduce what a Gang is, then talk about gang termination.

Gang termination is a mechanism in Grove

Need to change this as well as it is not a new concept introduced in Grove but a commonly known concept in gang scheduled workloads.

unmarshall · 2026-01-25T07:10:27Z

docs/designs/gang-termination.md

+
+Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.
+
+## Abbreviations


If we are going to have this section, then we will have to repeat it for every proposal that is written, which i think is quite unnecessary. Instead create a separate document for commonly used abbreviations and all documents can refer to that.

unmarshall · 2026-01-25T07:11:27Z

docs/designs/gang-termination.md

+
+| Abbreviation | Full Name | Description |
+|--------------|-----------|-------------|
+| PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups |


PCS does not manage PCLQ and PCSG, it is the controllers within Grove that manage these resources.

unmarshall · 2026-01-25T07:12:02Z

docs/designs/gang-termination.md

+| Abbreviation | Full Name | Description |
+|--------------|-----------|-------------|
+| PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups |
+| PCLQ | PodClique | Grove CRD representing a group of related pods |


PCLQ is a group of pods that share the same PodSpecTemplate and serve a single purpose.

unmarshall · 2026-01-25T07:30:41Z

docs/designs/gang-termination.md

+- **Configurable Grace Period:** Allow time for Kubernetes scheduler to recover before terminating
+- **Startup Protection:** Avoid terminating workloads that haven't yet reached full availability
+- **Rolling Update Safety:** Prevent false terminations during expected pod churn
+- **Multi-Level Granularity:** Support gang termination at PCLQ, PCSG, and PCS levels


What is gang termination at PCS level?

gang termination at PCS is whole PCS replica getting kicked back to scheduling.

unmarshall · 2026-01-25T07:33:05Z

docs/designs/gang-termination.md

+
+**Limitations:**
+
+- **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered


how is this a limitation?

Agreed this seems like a feature. Also its only once-ready pods right.

unmarshall · 2026-01-25T07:34:23Z

docs/designs/gang-termination.md

+
+- **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered
+- **Sticky WasAvailable:** Once a PodClique has reached MinAvailable, the WasAvailable flag never reverts, even if the workload is later degraded and recovers
+- **No Partial Recovery:** When gang termination triggers, the affected replica (PCS or PCSG) is deleted in its entirety, healthy PodCliques within that replica are not preserved.


how is this a limitation?

Also agreed, I believe this would only happen if minAvailable is breached at which point we want the whole thing triggered.

docs/designs/gang-termination.md

unmarshall · 2026-01-25T07:43:58Z

docs/designs/gang-termination.md

+Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:
+
+- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
+- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay


Why did we choose to have this behavior? If we have TerminationDelay at PCS and PCSG level then it should be respected when defined and termination delay is enabled. The issue is that nil value of PCS termination delay has been taken as an indication that this feature has been enabled or disabled. This IMHO is not so nice and also not very intuitive.

From API design perspective when we define a delay at multiple levels, thus allowing overriding, then if PCS does not define TerminationDelay but this is defined at the PCSG level then it should be honored as long as this feature is enabled.

hmm it seems a bit odd to do it that way as well because since minAvailable defaults to 1 (if I remember correctly) then it seems like you should be getting gang semantics throughout. If anything I would lean towards validating that you cant set it on just a pcsg. Something we should discuss more.

For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

minAvailable replicas of stand-alone PCLQs

minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

Standalone PCLQ - router

PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

unmarshall

2/n review comments

unmarshall · 2026-01-25T10:17:04Z

docs/designs/gang-termination.md

+Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:
+
+- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
+- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay


For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

minAvailable replicas of stand-alone PCLQs

minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

Standalone PCLQ - router

PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com> Signed-off-by: Geoff Flarity <geoff.flarity@gmail.com>

gflarity added the kind/bug Categorizes issue or PR as related to a bug. label Jan 19, 2026

gflarity requested review from Ronkahn21, sanjaychatterjee, shayasoolin and unmarshall as code owners January 19, 2026 18:15

gflarity added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement labels Jan 19, 2026

shayasoolin reviewed Jan 20, 2026

View reviewed changes

athreesh previously approved these changes Jan 21, 2026

View reviewed changes

gflarity dismissed athreesh’s stale review via b04bbb2 January 21, 2026 15:39

gflarity added 8 commits January 22, 2026 17:40

fix gang termination

7a640c4

base gang's might not exist after gang termination

f589f53

only include ready pods in breach calc

c06eed9

only mention a breach is ignored if readyReplicas < minAvailable

dcff359

add updated gang termination design doc

3ed4cd1

add missing api reference

0e0c83e

improve wording

aba3e9f

typo

0c07d49

gflarity force-pushed the gang_termination_fix branch from b04bbb2 to 0c07d49 Compare January 22, 2026 22:40

gflarity mentioned this pull request Jan 23, 2026

Gang Scheduling Race Condition During Rolling Updates #316

Closed

sanjaychatterjee reviewed Jan 23, 2026

View reviewed changes

nvrohanv previously approved these changes Jan 24, 2026

View reviewed changes

unmarshall added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2026

unmarshall reviewed Jan 25, 2026

View reviewed changes

Update docs/designs/gang-termination.md

b796ddf

Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com> Signed-off-by: Geoff Flarity <geoff.flarity@gmail.com>

gflarity dismissed nvrohanv’s stale review via b796ddf January 27, 2026 15:04

Time	Event	MinAvailableBreached Status	Reason	WasAvailable
T0	PCLQ created, 0 pods	False	InsufficientScheduledPods	false
T1	3 pods scheduled, starting	False	InsufficientScheduledPods	false

Time	Event	MinAvailableBreached Status	Reason	WasAvailable
T2	All 3 pods become ready	False	SufficientReadyPods	true
T3	1 pod dies (2/3 ready)	True	InsufficientReadyPods	true
T4	Operator crashes	-	-	-
T5	Operator restarts	True	InsufficientReadyPods	?

Time	Event	MinAvailableBreached Status	Reason	WasAvailable
T2	1 pod crashes during init	True	InsufficientReadyPods	false
T3	Operator crashes	-	-	-
T4	Operator restarts	True	InsufficientReadyPods	?

Field	Scenario A (T5)	Scenario B (T4)
`Status`	True	True
`Reason`	InsufficientReadyPods	InsufficientReadyPods
`ScheduledReplicas`	3	3
`ReadyReplicas`	2	2
Should gang terminate?	YES	NO


		# Overview

		Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.


		Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly.

		## Abbreviations


		Limitations:

		- Ready Pods Only (PCLQ level): Only ready pods count toward availability—starting pods are not considered

Gang termination fix #353

Are you sure you want to change the base?

Gang termination fix #353

Conversation

gflarity commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

shayasoolin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvrohanv Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athreesh commented Jan 21, 2026

Uh oh!

gflarity commented Jan 21, 2026

Uh oh!

gflarity commented Jan 21, 2026

Uh oh!

sanjaychatterjee commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gflarity Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Common Starting Point

Scenario A: Workload Becomes Available Then Degrades

Scenario B: Workload Never Becomes Available

State After Restart Comparison

Uh oh!

sanjaychatterjee commented Jan 23, 2026

Uh oh!

sanjaychatterjee commented Jan 23, 2026

Uh oh!

gflarity commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvrohanv commented Jan 24, 2026

Uh oh!

unmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gflarity commented Jan 19, 2026 •

edited

Loading

nvrohanv Jan 24, 2026 •

edited

Loading

gflarity Jan 27, 2026 •

edited

Loading

gflarity commented Jan 24, 2026 •

edited

Loading